Hash Tables
Direct-address tables
- An array, or direct-address table, denoted by
, in which each position, or slot, corresponds to a key in the universe U . - If the set contains no element with key
, then . - Direct addressing is a simple technique that works well when the
universe
of keys is reasonably small Each of these operations takes only1
2
3
4
5
6DIRECT-ADDRESS-SEARCH(T, k)
1 return T[k]
DIRECT-ADDRESS-INSERT(T, x)
1 T[x.key] =x
DIRECT-ADDRESS-DELETE(T, x)
1 T[x.key] = NILtime.
Hash tables
- Specifically, we can reduce the storage requirement to ‚
while we maintain the benefit that searching for an element in the hash table still requires only time. The catch is that this bound is for the average-case time, whereas for direct addressing it holds for the worst-case time. - we use a hash function
to compute the slot from the key . Here, maps the universe of keys into the slots of a hash table - We say that an element with key
hashes to slot ; we also say that is the hash value of key . - There is one hitch: two keys may hash to the same slot. We call this situation a collision
- Of course, the ideal solution would be to avoid collisions altogether. We might try to achieve this goal by choosing a suitable hash function h. One idea is to make h appear to be “random,” thus avoiding collisions or at least minimizing their number.
- Because
, however, there must be at least two keys that have the same hash value; avoiding collisions altogether is therefore impossible.
Collision resolution by chaining
- In chaining, we place all the elements that hash to the same slot into the same linked list
1 | CHAINED-HASH-INSERT(T, x) |
- We can delete an element in
time if the lists are doubly linked.If the hash table supports deletion, then its linked lists should be doubly linked so that we can delete an item quickly. - With singly linked lists, both deletion and searching would have the same asymptotic running times.
Analysis of hashing with chaining
- Given a hash table
with slots that stores elements, we define the load factor , for as , that is, the average number of elements stored in a chain. - The worst-case behavior of hashing with chaining is terrible: all
keys hash to the same slot, creating a list of length .,no better than if we used one linked list for all the elements - we shall assume that any given element is equally likely to hash into any of the m slots, independently of where any other element has hashed to. We call this the assumption of simple uniform hashing.
- For
, let us denote the length of the list by , so that ;and the expected value of is - Theorem 11.1:In a hash table in which collisions are resolved by
chaining, an unsuccessful search takes average-case time
, under the assumption of simple uniform hashing. - Theorem 11.2: In a hash table in which collisions are resolved by
chaining, a successful search takes average-case time
, under the assumption of simple uniform hashing.
Proof: The number of elements examined during a successful search for
an element x is one more than the number of elements that appear before
Let
Hash functions
What makes a good hash function?
- A good hash function satisfies (approximately) the assumption of simple uniform hashing
- A good approach derives the hash value in a way that we expect to be independent of any patterns that might exist in the data
- we note that some applications of hash functions might require stronger properties than are provided by simple uniform hashing
Interpreting keys as natural numbers
- Most hash functions assume that the universe of keys is the set
of natural numbers.
The division method
- the hash function is
- A prime not too close to an exact power of 2 is often a good choice
for
.
The multiplication method
- The multiplication method for creating hash
functions operates in two steps.
- First, we multiply the key
by a constant in the range and extract the fractional part of . - Then, we multiply this value by m and take the floor of the result.
In short, the hash function is
- First, we multiply the key
- An advantage of the multiplication method is that the value of m is not critical
- The optimal choice depends on the characteristics of the data being
hashed. Knuth
suggests that
Universal hashing
- choose the hash function randomly in a way that is independent of the keys that are actually going to be stored. This approach, called universal hashing
- Let
be a finite collection of hash functions that map a given universe of keys into the range . Such a collection is said to be universal if for each pair of distinct keys , the number of hash functions for which is at most . - Theorem 11.3:Suppose that a hash function h is chosen randomly from
a universal collection of hash functions and has been used to hash n
keys into a table T of size m, using chaining to resolve collisions. If
key k is not in the table, then the expected length
of the list that key hashes to is at most the load factor .If key is in the table, then the expected length of the list containing key is at most - Corollary 11.4:Using universal hashing and collision resolution by
chaining in an initially empty table with
slots, it takes expected time to handle any sequence of n INSERT, SEARCH, and DELETE operations containing INSERT operations.
Designing a universal class of hash functions
- Theorem 11.5:The class
of hash functions is universal.
Open addressing
- In open addressing, all elements occupy the hash table itself. That is, each table entry contains either an element of the dynamic set or NIL.
1 | HASH-INSERT(T,k) |
1 | HASH-SEARCH(T,k) |
- Deletion from an open-address hash table is difficult,We can solve this problem by marking the slot, storing in it the special value DELETED instead of NIL.We would then modify the procedure HASH-INSERT to treat such a slot as if it were empty so that we can insert a new key there. We do not need to modify HASH-SEARCH, since it will pass over DELETED values while searching.
- we assume uniform hashing: the probe sequence of
each key is equally likely to be any of the
permutations of .
techniques to compute the probe sequences
Linear probing
Given an ordinary hash function
Linear probing is easy to implement, but it suffers from a problem known as primary clustering
Quadratic probing
Quadratic probing uses a hash function of the form
Double hashing
Double hashing uses a hash function of the form